19 research outputs found

    Robust Lasso-Zero for sparse corruption and model selection with missing covariates

    Full text link
    We propose Robust Lasso-Zero, an extension of the Lasso-Zero methodology [Descloux and Sardy, 2018], initially introduced for sparse linear models, to the sparse corruptions problem. We give theoretical guarantees on the sign recovery of the parameters for a slightly simplified version of the estimator, called Thresholded Justice Pursuit. The use of Robust Lasso-Zero is showcased for variable selection with missing values in the covariates. In addition to not requiring the specification of a model for the covariates, nor estimating their covariance matrix or the noise variance, the method has the great advantage of handling missing not-at random values without specifying a parametric model. Numerical experiments and a medical application underline the relevance of Robust Lasso-Zero in such a context with few available competitors. The method is easy to use and implemented in the R library lass0

    R-miss-tastic: a unified platform for missing values methods and workflows

    Full text link
    Missing values are unavoidable when working with data. Their occurrence is exacerbated as more data from different sources become available. However, most statistical models and visualization methods require complete data, and improper handling of missing data results in information loss, or biased analyses. Since the seminal work of Rubin (1976), there has been a burgeoning literature on missing values with heterogeneous aims and motivations. This has resulted in the development of various methods, formalizations, and tools (including a large number of R packages and Python modules). However, for practitioners, it remains challenging to decide which method is most suited for their problem, partially because handling missing data is still not a topic systematically covered in statistics or data science curricula. To help address this challenge, we have launched a unified platform: "R-miss-tastic", which aims to provide an overview of standard missing values problems, methods, how to handle them in analyses, and relevant implementations of methodologies. In the same perspective, we have also developed several pipelines in R and Python to allow for a hands-on illustration of how to handle missing values in various statistical tasks such as estimation and prediction, while ensuring reproducibility of the analyses. This will hopefully also provide some guidance on deciding which method to choose for a specific problem and data. The objective of this work is not only to comprehensively organize materials, but also to create standardized analysis workflows, and to provide a common ground for discussions among the community. This platform is thus suited for beginners, students, more advanced analysts and researchers.Comment: 38 pages, 9 figure

    Are labels informative in semi-supervised learning? -- Estimating and leveraging the missing-data mechanism

    Full text link
    Semi-supervised learning is a powerful technique for leveraging unlabeled data to improve machine learning models, but it can be affected by the presence of ``informative'' labels, which occur when some classes are more likely to be labeled than others. In the missing data literature, such labels are called missing not at random. In this paper, we propose a novel approach to address this issue by estimating the missing-data mechanism and using inverse propensity weighting to debias any SSL algorithm, including those using data augmentation. We also propose a likelihood ratio test to assess whether or not labels are indeed informative. Finally, we demonstrate the performance of the proposed methods on different datasets, in particular on two medical datasets for which we design pseudo-realistic missing data scenarios

    Model-based Clustering with Missing Not At Random Data

    Full text link
    Traditional ways for handling missing values are not designed for the clustering purpose and they rarely apply to the general case, though frequent in practice, of Missing Not At Random (MNAR) values. This paper proposes to embed MNAR data directly within model-based clustering algorithms. We introduce a mixture model for different types of data (continuous, count, categorical and mixed) to jointly model the data distribution and the MNAR mechanism. Eight different MNAR models are proposed, which may depend on the underlying (unknown) classes and/or the values of the missing variables themselves. We prove the identifiability of the parameters of both the data distribution and the mechanism, whatever the type of data and the mechanism, and propose an EM or Stochastic EM algorithm to estimate them. The code is available on \url{https://github.com/AudeSportisse/Clustering-MNAR}. %\url{https://anonymous.4open.science/r/Clustering-MNAR-0201} We also prove that MNAR models for which the missingness depends on the class membership have the nice property that the statistical inference can be carried out on the data matrix concatenated with the mask by considering a MAR mechanism instead. Finally, we perform empirical evaluations for the proposed sub-models on synthetic data and we illustrate the relevance of our method on a medical register, the TraumaBase^{\mbox{\normalsize{\textregistered}}} dataset

    Debiasing Stochastic Gradient Descent to handle missing values

    Get PDF
    International audienceStochastic gradient algorithm is a key ingredient of many machine learning methods, particularly appropriate for large-scale learning.However, a major caveat of large data is their incompleteness.We propose an averaged stochastic gradient algorithm handling missing values in linear models. This approach has the merit to be free from the need of any data distribution modeling and to account for heterogeneous missing proportion.In both streaming and finite-sample settings, we prove that this algorithm achieves convergence rate of O(1n)\mathcal{O}(\frac{1}{n}) at the iteration nn, the same as without missing values. We show the convergence behavior and the relevance of the algorithm not only on synthetic data but also on real data sets, including those collected from medical register

    Gérer les données manquantes MNAR et hétérogènes dans différents scénarios d’apprentissage statistique : imputation basée sur des modèles à faible rang, régression linéaire en ligne avec un algorithme de descente de gradient stochastique, et partitionnement de données avec des modèles de mélange

    No full text
    En réponse à la collecte croissante de données, l'analyse statistique représente une réelle opportunité pour les applications, nombreuses et variées. Néanmoins, l'une des ironies du “big data” est que les données manquantes sont inévitables. Le but de cette thèse est de proposer de nouvelles méthodes statistiques pour traiter les valeurs manquantes dans plusieurs cas d'apprentissage supervisé et non supervisé, en particulier lorsque les données sont Missing Not At Random (MNAR), c.à.d. lorsque le manque des valeurs dépend des valeurs manquantes elles-mêmes et des valeurs d'autres variables. Une attention particulière est donnée à l'élaboration de méthodes aux fondements, à la fois théoriques et pratiques, solides, répondant aux besoins concrets posés par les applications. Nous étudions d’abord les modèles à faible rang avec effets fixes ou aléatoires lorsqu’il y a des valeurs MNAR sur plusieurs variables. Puis, nous abordons le cas de la régression linéaire en ligne avec des covariables manquantes en utilisant un algorithme de gradient stochastique moyenné débiaisé, ainsi que le cas du partitionnement de données à l’aide de modèles de mélange lorsqu’il y a des valeurs MNAR. Enfin, nous présentons notre plateforme collaborative sur le traitement des valeurs manquantes, pensée pour permettre la reproductibilité de la recherche et regroupant des méthodes classiques et récentes.The statistical analysis of growing masses of data represents a real added value for numerous and varied applications. Nevertheless, one of the ironies of the increased data collection is that missing data are unavoidable: the more data there are, the more missing data there are. The goal of this PhD thesis is to propose new statistical methods to handle missing values in several supervised and unsupervised machine learning scenarios, particularly when the data can be Missing Not At Random (MNAR), i.e. when the unavailability of values depends on the missing values themselves and values of other variables. A particular attention has been paid to derive methods relying on both strong theoretical and practical aspects, and meeting concrete needs in applications. First, low-rank models either with fixed or random effects are studied when MNAR values on several variables can occur. Second we address the case of online linear regression with missing covariates using a debiased averaged stochastic gradient algorithm. Furthermore, we investigate model-based clustering with MNAR data. Finally, we present our collaborative platform for reproducible research on missing values processing, that bundles classical and state-of-the-art methods

    Gérer les données manquantes MNAR et hétérogènes dans différents scénarios d’apprentissage statistique : imputation basée sur des modèles à faible rang, régression linéaire en ligne avec un algorithme de descente de gradient stochastique, et partitionnement de données avec des modèles de mélange

    No full text
    The statistical analysis of growing masses of data represents a real added value for numerous and varied applications. Nevertheless, one of the ironies of the increased data collection is that missing data are unavoidable: the more data there are, the more missing data there are. The goal of this PhD thesis is to propose new statistical methods to handle missing values in several supervised and unsupervised machine learning scenarios, particularly when the data can be Missing Not At Random (MNAR), i.e. when the unavailability of values depends on the missing values themselves and values of other variables. A particular attention has been paid to derive methods relying on both strong theoretical and practical aspects, and meeting concrete needs in applications. First, low-rank models either with fixed or random effects are studied when MNAR values on several variables can occur. Second we address the case of online linear regression with missing covariates using a debiased averaged stochastic gradient algorithm. Furthermore, we investigate model-based clustering with MNAR data. Finally, we present our collaborative platform for reproducible research on missing values processing, that bundles classical and state-of-the-art methods.En réponse à la collecte croissante de données, l'analyse statistique représente une réelle opportunité pour les applications, nombreuses et variées. Néanmoins, l'une des ironies du “big data” est que les données manquantes sont inévitables. Le but de cette thèse est de proposer de nouvelles méthodes statistiques pour traiter les valeurs manquantes dans plusieurs cas d'apprentissage supervisé et non supervisé, en particulier lorsque les données sont Missing Not At Random (MNAR), c.à.d. lorsque le manque des valeurs dépend des valeurs manquantes elles-mêmes et des valeurs d'autres variables. Une attention particulière est donnée à l'élaboration de méthodes aux fondements, à la fois théoriques et pratiques, solides, répondant aux besoins concrets posés par les applications. Nous étudions d’abord les modèles à faible rang avec effets fixes ou aléatoires lorsqu’il y a des valeurs MNAR sur plusieurs variables. Puis, nous abordons le cas de la régression linéaire en ligne avec des covariables manquantes en utilisant un algorithme de gradient stochastique moyenné débiaisé, ainsi que le cas du partitionnement de données à l’aide de modèles de mélange lorsqu’il y a des valeurs MNAR. Enfin, nous présentons notre plateforme collaborative sur le traitement des valeurs manquantes, pensée pour permettre la reproductibilité de la recherche et regroupant des méthodes classiques et récentes

    Imputation and low-rank estimation with Missing Not At Random data

    Get PDF
    International audienceMissing values challenge data analysis because many supervised and unsupervised learning methods cannot be applied directly to incomplete data. Matrix completion based on low-rank assumptions are very powerful solution for dealing with missing values. However, existing methods do not consider the case of informative missing values which are widely encountered in practice. This paper proposes matrix completion methods to recover Missing Not At Random (MNAR) data. Our first contribution is to suggest a model-based estimation strategy by modelling the missing mechanism distribution. An EM algorithm is then implemented, involving a Fast Iterative Soft-Thresholding Algorithm (FISTA). Our second contribution is to suggest a computationally efficient surrogate estimation by implicitly taking into account the joint distribution of the data and the missing mechanism: the data matrix is concatenated with the mask coding for the missing values; a low-rank structure for exponential family is assumed on this new matrix, in order to encode links between variables and missing mechanisms. The methodology that has the great advantage of handling different missing value mechanisms is robust to model specification errors.The performances of our methods are assessed on the real data collected from a trauma registry (TraumaBase ) containing clinical information about over twenty thousand severely traumatized patients in France. The aim is then to predict if the doctors should administrate tranexomic acid to patients with traumatic brain injury, that would limit excessive bleeding

    Estimation and imputation in Probabilistic Principal Component Analysis with Missing Not At Random data

    No full text
    International audienceMissing Not At Random (MNAR) values lead to significant biases in the data, since the probability of missingness depends on the unobserved values.They are "not ignorable" in the sense that they often require defining a model for the missing data mechanism, which makes inference or imputation tasks more complex. Furthermore, this implies a strong \textit{a priori} on the parametric form of the distribution.However, some works have obtained guarantees on the estimation of parameters in the presence of MNAR data, without specifying the distribution of missing data \citep{mohan2018estimation, tang2003analysis}. This is very useful in practice, but is limited to simple cases such as self-masked MNAR values in data generated according to linear regression models.We continue this line of research, but extend it to a more general MNAR mechanism, in a more general model of the probabilistic principal component analysis (PPCA), \textit{i.e.}, a low-rank model with random effects. We prove identifiability of the PPCA parameters. We then propose an estimation of the loading coefficients and a data imputation method. They are based on estimators of means, variances and covariances of missing variables, for which consistency is discussed. These estimators have the great advantage of being calculated using only the observed data, leveraging the underlying low-rank structure of the data. We illustrate the relevance of the method with numerical experiments on synthetic data and also on real data collected from a medical register

    R-miss-tastic: a unified platform for missing values methods and workflows

    No full text
    International audienceMissing values are unavoidable when working with data. Their occurrence is exacerbated as more data from different sources become available.However, most statistical models and visualization methods require complete data, and improper handling of missing data results in information loss or biased analyses. Since the seminal work of Rubin 1976, a burgeoning literature on missing values has arisen, with heterogeneous aims and motivations. This led to the development of various methods, formalizations, and tools. For practitioners, it remains nevertheless challenging to decide which method is most suited for their problem, partially due to a lack of systematic covering of this topic in statistics or data science curricula. To help address this challenge, we have launched the ``R-miss-tastic'' platform, which aims to provide an overview of standard missing values problems, methods, and relevant implementations of methodologies. Beyond gathering and organizing a large majority of the material on missing data (bibliography, courses, tutorials, implementations), ``R-miss-tastic'' covers the development of standardized analysis workflows. Indeed, we have developed several pipelines in R and Python to allow for hands-on illustration of and recommendations on missing values handling in various statistical tasks such as matrix completion, estimation and prediction, while ensuring reproducibility of the analyses. Finally, the platform is dedicated to users who analyze incomplete data, researchers who want to compare their methods and search for an up-to-date bibliography, and also teachers who are looking for didactic materials (notebooks, video, slides)
    corecore